We conduct large-scale studies on `human attention' in Visual QuestionAnswering (VQA) to understand where humans choose to look to answer questionsabout images. We design and test multiple game-inspired novelattention-annotation interfaces that require the subject to sharpen regions ofa blurred image to answer a question. Thus, we introduce the VQA-HAT (HumanATtention) dataset. We evaluate attention maps generated by state-of-the-artVQA models against human attention both qualitatively (via visualizations) andquantitatively (via rank-order correlation). Overall, our experiments show thatcurrent attention models in VQA do not seem to be looking at the same regionsas humans.
展开▼